Building Diverse Skillsets for Video Game Characters With Adversarial Skill Embeddings

In this article, we explore using large-scale reusable adversarial skill embeddings for physically simulated characters.
Soumik Rakshit
Created on July 6|Last edited on January 20
Comment
Humans are capable of performing an awe-inspiring variety of complex tasks by drawing on our vast repertoire of motor skills. This repertoire is built over a lifetime of interaction with the environment, leading to general-purpose skills that can be widely reused to accomplish new tasks. 
Notably, this is at odds with what is conventional practice in physics-based character animation and reinforcement learning, where control policies are typically trained from scratch to specialize in a single specific task. Developing more versatile and reusable models of motor skills could enable agents to solve tasks that would otherwise be prohibitively challenging to learn from scratch. However, manually constructing a sufficiently rich set of tasks and reward functions that can give rise to behaviors that are as diverse and versatile as those of humans would require an immense engineering effort. This leads us to the fundamental question we'll be looking at today: 
How then can we endow agents with large and versatile repertoires of skills?The authors of the paper ASE: Large-Scale Reusable Adversarial Skill Embeddings for Physically Simulated Characters attempt to answer just this. 
They draw inspiration from the domains of computer vision and natural language processing, where large, expressive models trained on massive datasets have been a central component of major advances. Not only can these models (think EfficientNet and GPT-3) solve challenging tasks, but they also provide powerful priors that can be reused for a wide range of downstream applications. 
The authors apply a similar data-driven paradigm for developing more general and versatile motor skills in a virtual agent. Rather than laboriously designing a rich set of training tasks that leads to a flexible range of behaviors, the agent is provided with a large unstructured motion dataset containing examples of behaviors we would like the agent to acquire. This dataset is then used by the authors to train the agent to perform a large variety of skills by imitating the behaviors depicted in the dataset. By modeling the learned skills with representations that are suitable for reuse, the authors were able to develop more capable and versatile motor skill models that can be re-purposed for a wide range of new applications.
﻿
The skills learned by virtue of Adversarial Skill Embeddings can be used to solve a diverse set of downstream tasks while enabling a physically simulated charactera agent to produce naturalistic behaviors that resemble the original dataset.2
﻿
This article was written as a Weights & Biases Report which is a project management and collaboration tool for machine learning projects. Reports let you organize and embed visualizations, describe your findings, share updates with collaborators, and more. To know more about reports, check out Collaborative Reports.
💡
﻿
Table of ContentsOverview of the FrameworkThe Pre-Training StageTask TrainingSimilar Reports
﻿
﻿
Overview of the FrameworkThe Adversarial Skill Embedding framework consists of two stages: pre-training and transfer.
During pre-training, a low-level policy π(a∣s,z)\pi(a|s, z)π(a∣s,z)﻿ is trained to map latent skills zzz﻿ to behaviors that resemble motions depicted in a dataset. The policy is trained to model a diverse repertoire of skills using a reward function that combines an adversarial imitation objective, specified by a discriminator DDD﻿, and an unsupervised skill discovery, specified by an encoder qqq﻿.
After pre-training, π\piπ﻿ can be transferred to new tasks by using a task-specific high-level policy ω(z∣s,g)\omega(z|s, g)ω(z∣s,g)﻿ to specify latent variables zzz﻿ for directing the low-level policy towards accomplishing a task-specific goal ggg﻿.
﻿
An overview of the Adversarial Skill Embedding framework.1
﻿
﻿
The Pre-Training StageIn the pre-training stage, the objective of the framework is to train a low-level policy to model a large set of skills that resemble natural human behavior. The skills modeled in this stage should be diverse and directable so that they are general enough to solve a wide range of tasks while also being easy for the high-level policy to control. This is achieved by combining techniques from Adversarial Imitation Learning and Unsupervised Reinforcement Learning. Given a motion dataset M\mathcal{M}M﻿, the pre-training objective is given by:
max⁡π−DJS(dπ(s,s′)∥dM(s,s′))+βI(s,s′;z∣π)\Large{\max _{\pi}-D_{\mathrm{JS}}\left(d^{\pi}\left(\mathbf{s}, \mathbf{s}^{\prime}\right) \| d^{\mathcal{M}}\left(\mathbf{s}, \mathbf{s}^{\prime}\right)\right)+\beta I\left(\mathbf{s}, \mathbf{s}^{\prime} ; \mathbf{z} \mid \pi\right)}maxπ​−DJS​(dπ(s,s′)∥dM(s,s′))+βI(s,s′;z∣π)﻿
﻿−DJS(dπ(s,s′)∥dM(s,s′))-D_{\mathrm{JS}}\left(d^{\pi}\left(\mathbf{s}, \mathbf{s}^{\prime}\right) \| d^{\mathcal{M}}\left(\mathbf{s}, \mathbf{s}^{\prime}\right)\right)−DJS​(dπ(s,s′)∥dM(s,s′))﻿ is the imitation objective that encourages the policy to produce behaviors that resemble a dataset by training a discriminator to predict if a given motion was produced by the simulated character. The low-level policy is then trained to produce motions that fool the discriminator.
﻿
The Imitation Objective1
﻿
Next, the skill discovery objective βI(s,s′;z∣π)\beta I\left(\mathbf{s}, \mathbf{s}^{\prime} ; \mathbf{z} \mid \pi\right)βI(s,s′;z∣π)﻿ encourages the policy to produce a diverse and discrete set of behaviors by maximizing the mutual information between latent variables and their resulting motions. This is achieved by training an encoder to predict the latent variable that produced a particular motion. The low-level policy is then trained to produce distinct behaviors for each latent variable, so that the encoder can more easily recover the original latents.
﻿
The Skill Discovery Objective1
﻿
The Motion DatasetThe low-level policy is trained to imitate behaviors from a large unstructured motion dataset containing about 30 minutes of motion data. The dataset contains motion clips showing common behaviors, like walking and running, as well as motion clips that depict a gladiator character wielding a sword and shield.
﻿
Samples from the Motion Dataset1
﻿
Large-Scale TrainingThe authors use characters simulated using Isaac Gym, a high-performance GPU-based simulator by Nvidia. It enabled the authors to perform large-scale training of the low-level policy with approximately 10 years of simulated experiences requiring just 10 days of real-world time.
﻿
Large-scale training of the low-level policy1
﻿
Robust RecoveriesA common failure case for physically simulated characters is losing balance and falling when subjected to unexpected perturbations. In addition to training low-level policy to imitate reference motions, the policy is also trained to recover the character from fallen states. This enables the policy to develop robust recovery strategies which can consistently get back after falling. These recovery strategies will then enable the character to automatically recover from perturbations when performing new tasks.
﻿
A demonstration of Robust Recoveries by the trained Character1
﻿
﻿
Task TrainingAfter pre-training, the low-level policy can be reused to perform new downstream tasks by training a high-level policy to specify latent variables for directing the low-level policy toward completing the desired objectives.
﻿
Task-training pipeline to reused the pre-trained low-level policy for downstream tasks1
﻿
Demonstrations of Downstream TasksNo motion data is used when training on downstream tasks, unlike in the pre-training stage. Nonetheless, the pre-trained low-level policy allows the character to produce naturalistic motions even in the absence of motion data. By training the low-level policy to recover from perturbations, the character can automatically get back up after failing when performing new tasks.
﻿
Demonstration of Target Speed Task1
﻿
﻿
Similar Reports
Robotic Telekinesis in the Wild
In this article, we teach a robotic hand imitator by watching humans on Youtube to enable any operator in the wild with only a single uncalibrated color camera.
Learning Robust Perceptive Locomotion for Quadrupedal Robots in the Wild
Or, teaching four-legged robots to walk in the real world
Training Reproducible Robots with W&B
How I'm using W&B in my Robot Training Workflow.
Offline Reinforcement Learning with Collaborative Datasets
highlights: data visualization; easy collaboration; reproducible research 
﻿
﻿
Add a comment
Tags: Computer Vision, Articles, TMP, Reinforcement Learning, Experiment, Plots, Video
Iterate on AI agents and models faster. Try Weights & Biases today.